Red Wine Exploration

by David Vartanian

Abstract

I describe a dataset with almost 1600 types of red wine, in order to understand the meaning of the assigned score.

Introduction

This dataset is provided by Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis, from different universities in Portugal. It provides information like acidity, residual sugar, chlorides, and alcohol among others. I explore the data to find patterns and trends and get the meaning of the given fetures. More information here.

Univariate Plots Section

Let’s start showing some summary numbers and first histograms to understand individual variables.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Histograms: quality, fixed.acidity, total.sulfur.dioxide, alcohol

These histograms show the distribution of values in the different variables.

Univariate Analysis

Dataset Structure

There are 9 continuous variables, 2 discrete variables and one categorical variable: quality.

Main dataset interest

My general question is, how chemical properties define the quality of the red wine?

There are interesting features in this dataset, each of them describing an important property of the red wine. Density, pH, sulphur dioxide, and sulphates are, in my opinion, the most important ones, in order to measure the quality. Let’s see what can we find looking at those variables.

pH

This variable indicates the acidity level of the wine. The scale goes from 0 (very acid) to 14 (very basic). But most of red wines are between 3 and 4.

It’s quite surprising that levels of pH are lower on low-quality and high-quality wines.

Density of water

The level of this variable depends on alcohol percentage and sugar.

Density levels are also lower on low-quality and high-quality wines.

Free Sulphure Dioxide

The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of the wine.

Again, the levels for this variable are lower for low-quality and high-quality wines.

Sulphates

Additive contributing with sulphure dioxide gas (S02) levels, acting as an antimicrobial and antioxidant.

Sulphates levels are lower for low-quality and high-quality wines as well.

Variable Transformations

It was not necessary to clean missing values on this dataset. However, I think it is a good idea to apply some transformations to skewed variables.

Transformed Volatile Acidity using log base 10.

Transformed Fixed Acidity using log base 10.

Transformed Total Sulphure Dioxide using log base 10.

Transformed Chlorides using log base 10.

Transformed Residual Sugar using log base 10.

Transformed Sulphates using log base 10.

Transformed Free Sulphure Dioxide using log base 10.

Bivariate Plots Section

Let’s try to find trends, and interesting patterns by comparing two variables.

Fact: Higher quality wines seem to have higher levels of alcohol

Fact: Higher quality wines seem to have lower levels of acidity

Fact: Higher quality wines seem to have lower density

Citric Acid adds freshness flavor to the wine.

Level of acetic acid. Too high levels make an unpleasant vinegar taste.

Bivariate Analysis

Relationships

Interesting relationships

Density vs Quality

I’ve found a slightly positive correlation, meaning that density tends to be lower on high-quality wines. However, this correlation is not so important to determine the quality.

Chlorides vs Sulphates

I’ve found that levels are mostly low for both variables. I would say that they don’t influence much on the quality as all types of wine have the same level of these two variables.

Chlorides vs Residual Sugar

I’ve found the same here, as they keep levels constantly low.

Free vs. Total Sulphure Dioxide

Levels keep always low. However these two variables seem to be correlated.

Multivariate Plots Section

Level of durability regarding alcohol and quality.

Level of durability regarding alcohol and density.

Multivariate Analysis

##     quality      mean_quality   mean_alcohol    mean_density   
##  Min.   :3.00   Min.   :3.00   Min.   : 9.90   Min.   :0.9952  
##  1st Qu.:4.25   1st Qu.:4.25   1st Qu.:10.03   1st Qu.:0.9962  
##  Median :5.50   Median :5.50   Median :10.45   Median :0.9966  
##  Mean   :5.50   Mean   :5.50   Mean   :10.72   Mean   :0.9965  
##  3rd Qu.:6.75   3rd Qu.:6.75   3rd Qu.:11.26   3rd Qu.:0.9970  
##  Max.   :8.00   Max.   :8.00   Max.   :12.09   Max.   :0.9975  
##     mean_ph      mean_citric_acid       n         
##  Min.   :3.267   Min.   :0.1710   Min.   : 10.00  
##  1st Qu.:3.294   1st Qu.:0.1915   1st Qu.: 26.75  
##  Median :3.312   Median :0.2588   Median :126.00  
##  Mean   :3.327   Mean   :0.2715   Mean   :266.50  
##  3rd Qu.:3.366   3rd Qu.:0.3498   3rd Qu.:528.25  
##  Max.   :3.398   Max.   :0.3911   Max.   :681.00

Density & pH

A pretty strong correlation can be observed between these two variables, regarding the quality of wines. Meaning that it’s normal to find lower levels of pH and density on high-quality wines.

Durability

Using the new variable durability it’s possible to appreciate that high-quality wines with high graduation of alcohol are more likely to last in good conditions due to the effect of sulphates and free sulphure dioxide.


Final Plots and Summary

Durability & Alcohol

Something very remarkable to keep in mind is what this plot shows: high-quality wines seem to last more time. But the orange line on the top-right corner makes a huge difference. They last much longer when alcohol level is higher.

Citric Acid vs. pH

This is a pretty straight forward correlation. When pH level gets lower (which means that there is more acid) citric acid gets higher. It makes sense, right?

Density by Quality Level

I wanted to emphasize this plot again because levels of density look similar for both low-quality and high-quality wines. Or from another perspective, the density of water is higher only on mid-quality wines.


Reflection

I feel that now I have a few extra tips to select new wines to taste. Higher levels of alcohol and acidity, lower levels of density, as well as low levels of residual sugar, chlorides, and sulphates. High levels of alcohol and low level of density were definitely surprising for me. However, I think that the data set needs some more categorical variables and much more data to make a better analysis.